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Abstract 



Which ads should we display in sponsored search in order to maximize our rev- 
enue? How should we dynamically rank information sources to maximize value 
of information? These applications exhibit strong diminishing returns: Selection 
of redundant ads and information sources decreases their marginal utility. We 
show that these and other problems can be formalized as repeatedly selecting an 
assignment of items to positions to maximize a sequence of monotone submodular 
functions that arrive one by one. We present an efficient algorithm for this general 
problem and analyze it in the no-regret model. Our algorithm possesses strong 
theoretical guarantees, such as a performance ratio that converges to the optimal 
constant of 1 — 1/e. We empirically evaluate our algorithm on two real-world 
online optimization problems on the web: ad allocation with submodular utilities, 
and dynamically ranking blogs to detect information cascades. 

1 Introduction 

Consider the problem of repeatedly choosing advertisements to display in sponsored search to 
maximize our revenue. In this problem, there is a small set of positions on the page, and each time 
a query arrives we would like to assign, to each position, one out of a large number of possible 
ads. In this and related problems that we call online assignment learning problems, there is a set 
of positions, a set of items, and a sequence of rounds, and on each round we must assign an item 
to each position. After each round, we obtain some reward depending on the selected assignment, 
and we observe the value of the reward. When there is only one position, this problem becomes 
the well-studied multiarmed bandit problem [2J. When the positions have a linear ordering the 
assignment can be construed as a ranked list of elements, and the problem becomes one of selecting 
lists online. Online assignment learning thus models a central challenge in web search, sponsored 
search, news aggregators, and recommendation systems, among other applications. 

A common assumption made in previous work on these problems is that the quality of an assignment 
is the sum of a function on the (item, position) pairs in the assignment. For example, online advertising 
models with click-through-rates |6| make an assumption of this form. More recently, there have been 
attempts to incorporate the value of diversity in the reward function |16|. Intuitively, even though the 
best K results for the query "turkey" might happen to be about the country, the best list of K results 
is likely to contain some recipes for the bird as well. This will be the case if there are diminishing 
returns on the number of relevant links presented to a user; for example, if it is better to present each 
user with at least one relevant result than to present half of the users with no relevant results and 
half with two relevant results. We incorporate these considerations in a flexible way by providing 
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an algorithm that performs well whenever the reward for an assignment is a monotone submodular 
function of its set of (item, position) pairs. 

Our key contributions are: (;) an efficient algorithm, TabularGreedy, that provides a constant 
factor (1 — 1/e) approximation to the problem of optimizing assignments under submodular utility 
functions, {ii) an algorithm for online learning of assignments, TGbandit, that has strong per- 
formance guarantees in the no-regret model, and (///) an empirical evaluation on two problems of 
information gathering on the web. 



2 The assignment learning problem 



We consider problems, where we have K positions (e.g., slots for displaying ads), and need to assign 
to each position an item (e.g., an ad) in order to maximize a utility function (e.g., the revenue from 
clicks on the ads). We address both the offline problem, where the utility function is specified in 
advance, and the online problem, where a sequence of utility functions arrives over time, and we need 
to repeatedly select a new assignment. 



The Offline Problem. In the offline problem we are given sets Pi, P2, . . . , Pk, where P^ is the 
set of items that may be placed in position k. We assume without loss of generality that these sets 
are disjointj^ An assignment is a subset 5 C V, where V = Pi U P2 U • • • U Pk is the set of all 
items. We call an assignment /eai/Z?^, if at most one item is assigned to each position (i.e., for all k, 
\S Pk\ < !)■ We use V to refer to the set of feasible assignments. 

Our goal is to find a feasible assignment maximizing a utility function / : 2^ — > M>o. As we discuss 
later, many important assignment problems satisfy siibmodiilarity, a natural diminishing returns 
property: Assigning a new item to a position k increases the utility more if few elements have been 
assigned yet, and less if many items have already been assigned. Formally, a utility function / is 
called submodular, if for all S" C 5'ands ^ 5" itholds that /(5U{s})-/(S') > /(S"U{s})-/(S"). 
We will also assume / is monotone (i.e., for all 5* C S' , we have f{S) < f{S')). Our goal is thus, 
for a given non-negative, monotone and submodular utility function /, to find a feasible assignment 
S* of maximum utility, S* = argmax^^^ 

This optimization problem is NP-hard. In fact, a stronger negative result holds: 

Theorem 1 (| 14|). For any e > 0, any algorithm guaranteed to obtain a solution within a factor of 
(1 — 1/e + e) ofmaxsev {fiS)} requires exponentially many evaluations of f in the worst case. 

In light of this negative result, we can only hope to obtain a solution that achieves a fraction of 



(1 — 1 /e) of the optimal value. In ^ 3.2 we develop such an algorithm. 



The OnUne Problem. The offline problem is inappropriate to model dynamic settings, where the 
utility function may change over time, and we need to repeatedly select new assignments, trading 
off exploration (experimenting with ad display to gain information about the utility function), and 
exploitation (displaying ads which we believe will maximize utility). More formally, we face a 
sequential decision problem, where, on each round (which, e.g., corresponds to a user query for a 
particular term), we want to select an assignment St (ads to display). We assume that the sets Pi, 
P2, . . . , Pk are fixed in advance for all rounds. After we select the assignment we obtain reward 
ft{St) for some monotone submodular utility function ft. We call the setting where we do not 
get any information about ft beyond the reward the bandit feedback model. In contrast, in the 
full-information feedback model we obtain oracle access to ft (i.e., we can evaluate ft on arbitrary 
feasible assignments). Both models arise in real applications, as we show in |5j 

The goal is to maximize the total reward we obtain, namely J2t fti'^t)- Following the multiarmed 
bandit literature, we evaluate our performance after T rounds by comparing our total reward against 
that obtained by a clairvoyant algorithm with knowledge of the sequence of functions (/i, . . . , fr), 
but with the restriction that it must select the same assignment on each round. The difference between 
the clairvoyant algorithm's total reward and ours is called our regret. The goal is then to develop an 
algorithm whose expected regret grows sublinearly in the number of rounds; such an algorithm is 
said to have (or be) no-regret. However, since sums of submodular functions remain submodular, the 



If the same item can be placed in multiple positions, simply create multiple distinct copies of it. 
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clairvoyant algorithm has to solve an offline assignment problem with /(S") = J2tft(^)- Considering 
Theorem 1 no polynomial-time algorithm can possibly hope to achieve a no-regret guarantee. To 
accommodate this fact, we discount the reward of the clairvoyant algorithm by a factor of (1 — 1/e): 
We define the (1 — l/e)-regret of a random sequence {Si, . . . , St) as 
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Our goal is then to develop efficient algorithms whose (1 — l/e)-regret grows sublinearly in T. 



Subsumed Models. Our model generalizes several common models for sponsored search ad 
selection, and web search results. These include models with click-through-rates, in which it is 
assumed that each (ad, position) pair has some probability p{a, k) of being clicked on, and there is 
some monetary reward b{a) that is obtained whenever ad a is clicked on. Often, the click-through- 
rates are assumed to be separable, meaning p{a, k) has the functional form a{a) ■ (3{k) for some 
functions a and /3. See lEJIEl for more details on sponsored search ad allocation. Note that in both 
of these cases, the (expected) reward of a set S of (ad, position) pairs is J2{a k)es di'^' ^) ^'^^ some 
nonnegative function g. It is easy to verify that such a reward function is monotone submodular. 
Thus, we can capture this model in our framework by setting P^. = Ax {k}, where A is the set of 
ads. Another subsumed model, for web search, appears in 1 16|; it assumes that each user is interested 
in a particular set of results, and any list of results that intersects this set generates a unit of value; 
all other lists generate no value (the ordering of results is irrelevant). Again, the reward function is 
monotone submodular. In this setting, it is desirable to display a diverse set of results in order to 
maximize the likelihood that at least one of them will interest the user. 

Our model is flexible in that we can handle position-dependent effects and diversity considerations 
simultaneously. For example, we can handle the case that each user u is interested in a particular 
set Au of ads and looks at a set /„ of positions, and the reward of an assignment S is any monotone- 
increasing concave function g of \S D x /„)!. If 1^ — {1,2, ... , k} and g{x) — x, this models 
the case where the quality is the number of relevant result that appear in the first k positions. If /„ 
equals all positions and g{x) — min {a;, 1} we recover the model of 1161 . 

3 An approximation algorithm for the offline problem 
3.1 The locally greedy algorithm 

A simple approach to the assignment problem is the following greedy procedure: the algorithm 
steps through all K positions (according to some fixed, arbitrary ordering). For position k, it simply 
chooses the item that increases the total value as much as possible, i.e., it chooses 

Sk = argmax{/({si, . . . ,Sfc_i} + s)} , 

where, for a set S and element e, we write 5* + e for S* U {e}. Perhaps surprisingly, no matter which 
ordering over the positions is chosen, this so-called locally greedy algorithm produces an assignment 
that obtains at least half the optimal value |8|. In fact, the following more general result holds. We 
will use this lemma in the analysis of our improved offline algorithm, which uses the locally greedy 
algorithm as a subroutine. 

Lemma 2. Suppose f : 2^ ^ R>q is of the form f{S) = fo{S) + ^f^^ fkiS n Pk) where 
/o : 2^ R>o is monotone submodular, and fk-Pk^ ^>q '■s arbitrary for k > 1. Let L be the 
solution returned by the locally greedy algorithm. Then f{L) + fo{L) > maxgg-p {f{S)}. 

The proof is given in Appendix A. Observe that in the special case where fk — for all fc > 1, 
Lemma|2jsays that f{L) > ^ maxgg-p f{S). 

The following example shows that the 1 /2 approximation ratio is tight. Consider an instance of the 
ad allocation problem with two ads, two positions and two users, Alice and Bob. Alice is interested 
in ad 1, but has a very short attention span: She will only click on the ad if it appears in the first 
position. Bob is interested in ad 2, and will look through all positions. Now suppose that Alice 
searches slightly less frequently (with probability 5 — e) than Bob (who searches with probability 
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^ + e). The greedy algorithm first chooses the ad to assign to slot 1 . Since the ad is more likely to be 
shown to Bob, the algorithm chooses ad 2, with an expected utility of | + e. Since Alice will only 
look at position 1, no ad assigned to slot 2 can increase the expected utility further On the other hand, 
the optimal solution is to assign ad 1 to slot 1, and ad 2 to slot 2, with an expected utility of 1. 

3.2 An algorithm with optimal approximation ratio 

We now present an algorithm that achieves the optimal approximation ratio of 1 — 1 /e, improving on 
the i approximation for the locally greedy algorithm. Our algorithm associates with each partition 
Pfc a color Cfe from a palette [C] of C colors, where we use the notation [n] = {1, 2, . . . , n}. For any 
set S* C V X [C] and vector c = (ci, . . . , ck), define 

K 

samp\eg{S) ^ [J {x e Pk ■■ ix,Ck) e S} . 

k=l 

Given a set S of (item, color) pairs, which we may think of as labeling each item with one or more 
colors, samplej(5) returns a set containing each item x that is labeled with whatever color c assigns 
to the partition that contains x. Let F{S) denote the expected value of f{samp\eg{S)) when each 
color Cfc is selected uniformly at random from [C] . We consider the following algorithm. 



Algorithm: TabularGreedy 








Input: integer C, sets Pi, P2, ■ ■ . , Pr, function f -.2^ 


>o (where V = 






set G 0. 








for c from 1 to C do / * 


For each 


color 


*/ 


for k from 1 to do / * 


For each 


partition 


*/ 


set5fc,c = argmax^gp^x{c} {F'yG + x)} /* 


Greedily 


pick gfc^c 


*/ 


set G := G + gk^c. 








for each fc e choose uniformly at random from [G]. 








return sample^(G), where c:— (ci, . . . , ck)- 









Observe that when G = 1, the sample^, function is deterministic and TabularGreedy is simply 
the locally greedy algorithm from p.l[ In the limit as G ^ 00, the TabularGreedy can 
intuitively be viewed as an algorithm for a continuous extension of the problem followed by a 
rounding procedure, in the same spirit as Vondrak's continuous- greedy algorithm |4|. In our case, the 
continuous extension is to compute a probability distribution Di^ for each position fc with support in 
Pfe (plus a special "select nothing" outcome), such that if we independently sample an element X]^ 
from Dfc, E [/({a;i, . . . , 2: a:})] is maximized. It turns out that if the positions individually, greedily, 
and in round-robin fashion, add infinitesimal units of probability mass to their distributions so as to 
maximize this objective function, they achieve the same objective function value as if, rather than 
making decisions in a round-robin fashion, they had cooperated and added the combination of K 
infinitesimal probability mass units (one per position) that greedily maximizes the objective function. 
The latter process, in turn, can be shown to be equivalent to a greedy algorithm for maximizing a 
(different) submodular function subject to a cardinality constraint, which implies that it achieves 
a 1 — 1/e approximation ratio |15|. TabularGreedy represents a tradeoff between these two 
extremes; its performance is summarized by the following theorem. For now, we assume that the 
arg max in the inner loop is computed exactly. In Appendix A we bound the performance loss that 
results from approximating the arg max (e.g., by estimating F by repeated sampling). 

Theorem 3. Suppose f is monotone submodular Then F{G) > I3{K^ C) ■ Tuax.sev where 
P{K, G) is defined as 1 - {1 - - {2)^'^- 

It follows that, for any e > 0, TabularGreedy achieves a(l — 1/e — e) approximation factor 
using a number of colors that is polynomial in K and 1/e. The theorem will follow immediately 
from the combination of two key lemmas, which we now prove. Informally, Lemma 4 analyzes the 
approximation error due to the outer greedy loop of the algorithm, while Lemma |5j analyzes the 
approximation error due to the inner loop. 
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Lemma 4. Let Gc — {ffi.c,52,c, ■ • • , <7if,c}, and let G^ ~ Gi U G2 U . . . U Gc-i- For each 
color c, choose Ec (z R such that F{G~ U Gc) > maxxeKc {^i^c U x)} — E'c where TZc ■— 
{R : Vfc G [K] , \R n (Pfe X {c})| 1} is the set of all possible choices for Gc- Then 

c 

F{G)>P{G)-m^x{f{S)}~Y,Ec. (3.1) 

c=l 

where P{C) = 1 - {1 - . 

Proof (Sketch). We will refer to an element R of TZc as a row, and to c as the color of the row. Let 
^[c] U^sLi be the set of all rows. Consider the function H : 2'^[^i R>o, defined as 
Ren prove the lemma in three steps: (/) H is monotone submodular, (//) 

TabularGreedy is simply the locally greedy algorithm for finding a set of K rows that maximizes 
H, where the c* greedy step is performed with additive error Ec, and (///) TabularGreedy obtains 
the guarantee ( |3.1j ) for maximizing H, and this implies the same ratio for maximizing F. 

To show that H is monotone submodular, it suffices to show that F is monotone submodular. 
Because F{S) = [/(samplej(5))], and because a convex combination of monotone submodular 
functions is monotone submodular, it suffices to show that for any particular coloring c, the function 
/(sampleg-(5)) is monotone submodular. This follows from the definition of sample and the fact 
that / is monotone submodular. 

The second claim is true by inspection. To prove the third claim, we use the fact that F{G~ U R) 
can always be maximized by choosing a row R G TZc- Informally, this is because F{Gc U R) can 
always be maximized by choosing a row whose color has not already been used, and all colors > c 
are interchangeable. For problems with this special property, it is known that the locally greedy 
algorithm obtains an approximation ratio of /3(C) = 1 — (1 — ^)'^ fl5J . Theorem 6 of l,17J extends 
this result to handle additive error, and yields 

C 

F{G) = H{{Gi,G2,...,Gc})>P{G)- max {i^(7^)} - V . 

KCK[ciHK|<C ^ 

To complete the proof, it suffices to show that maxKCK[c]:|7?,|<c {H{TZ)} > ma,xsev {f{S)}. This 
follows from the fact that for any assignment S G V, we can find a set TZ{S) of G rows such that 
samplej(lj^g7j(_5) R) ^ S with probability 1, and therefore H{n{S)) = f{S). □ 

We now bound the performance of the the inner loop of TabularGreedy. 

Lemma 5. Let f* — maxgg-p {/(S*)}, and let Gc, G^, and TZc be defined as in the statement of 
Lemma^ Then, for any c € [G], 

F{GZ U Gc) > max {F{G- U R)} - {^^C-^ f * ■ 

Proof (Sketch). Let N denote the number of partitions whose color is c. For R e TZc, let Ag-(i?) :— 
/(sample5.(G- U R)) - /(sample£.(G-)), and let Fc{R) := F{G- \JR)- F{G-). By definition, 
Fc{R) = E5[A5(i?)] ^V[N ^ l]Ec-[Ac-(i?)|7V = 1] + P[7V > 2] E^- [Ac-(i?)|iV > 2], where we 
have used the fact that Ag(i?) = when iV = 0. The idea of the proof is that the first of these terms 
dominates as G ^ 00, and that E^ [A^{R)\N = 1] can be optimized exactly simply by optimizing 
each element of Pk x {c} independently. Specifically, it can be seen that Ej [Aj(i?)|A^ = 1] = 
Ef=i fk{R n {Pk X {c})) for suitable /fe. Additionally, /o(i?) =V[N >2] E,- [A^(i?)|iV > 2] is 
a monotone submodular function of a set of (item, color) pairs, for the same reasons F is. Applying 
Lemma 2 with these {Jk : fc > 0} yields 

Fc{Gc) + P [iV > 2] ¥.s[^s{Gc)\N > 2] > max {Fc{R)} . 

R^TZc 

To complete the proof, it suffices to show that P [A^ > 2] < {^)G~^ and that Eg- [A^iGc) |A^ > 2] < 
/* . The first inequality holds because, if we let M be the number of pairs of partitions that are both 
assigned color c, we have P [A > 2] = P [M > 1] < E [M] = )G"2 x^e second inequality 
follows from the fact that for any c we have Ag-(Gc) < /(sampleg.(G^ U Gc)) < ./*. □ 
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4 An algorithm for online learning of assignments 



We now transform the offline algorithm of p.2| into an online algorithm. The high-level idea behind 
this transformation is to replace each greedy decision made by the offline algorithm with a no-regret 
online algorithm. A similar approach was used in ||T6| and [|18il to obtain an online algorithm for 
different (simpler) online problems. 



Algorithm: TGbandit (described in the full-information feedback model) 
Input: sets Pi, P2, . .., Pk 

for each k e [K] and c e [C], let £k.c be a no-regret algorithm with action set Pj. x {c}. 

for t from 1 to T do 

for each k e [K] and c e [C], let gl Pk ^ {c} be the action selected by £k,c 
for each k E [K], choose Ck uniformly at random from [C]. Define c— (ci, . . . , ck)- 
select the set Gt = sample^ ({5fc,c ■ ^ ^ [K],cE [C]}) 

observe ft, and let Ft{S) := /t(sampleg.(S')) 
for each k e [K], c e [C] do 

define G^;, ^ : fc' e [K] , c' < c} U {gl^, : k' < k} 

for each x E P^ x {c}, feed back Ft{Gl^c + •^) '-^ ^k,c as the reward for choosing x 



The following theorem summarizes the performance of TGBANDIT. 
Theorem 6. Let rk,c be the regret ofSk.c, and let (3{K, C) = 1 — (l — ^)'" 



E 



>(3{K,C)-max\j2MS)\-E 



C'-^. Then 



■ K C 
.k=l c=l 



Observe that Theorem '6'is similar to Theorem 3 with the addition of the E [r^ d terms. The idea 
of the proof is to view TGBANDIT as a version of TabularGreedy that, instead of greedily 
selecting single (element,color) pairs g^^c ^ Pk ^ {c}, greedily selects (element vector, color) pairs 
gk.c € Pk ^ {c} (this greedy decision is made with additive error rk.c, which is the source of 
the extra terms). Once this correspondence is established, the theorem follows along the lines of 
Theorem|3] For a proof, see Appendix A. 

Corollary 7. //TGbandit is run with randomized weighted majority /jJ^ as the subroutine, then 



E 



E/*(G*) 



K 



> (3(K, C) ■ max ■ 



]MS)\-0(cY,VTlog\Pk 



k=l 



where P{K, C) = 1 - (l 



cl 



c- 



Optimizing for C in Corollary |7j yields (1 - i)-regret e{K^^'^T^/'^VOPT) ignoring logarithmic 
factors, where OPT := maxgg-p /t('5')| is the value of the static optimum. 



Dealing with bandit feedback. TGbandit can be modified to work in the bandit feedback model. 
The idea behind this modification is that on each round we "explore" with some small probability, in 
such a way that on each round we obtain an unbiased estimate of the desired feedback values -Ft(G^~^+ 
x) for each k G [K], c £ [C], and x £ Pk- This technique can be used to achieve a bound similar to 
the one stated in Corollary'?! but with an additive regret term of O ^(T | V| CK) ^ (log |V|)^y 

Stronger notions of regret. By substituting in different algorithms for the subroutines £k,c, we can 
obtain additional guarantees. For example, Blum and Mansour fS) consider online problems in which 
we are given time-selection functions /i, /2, . . . , Im- Each time-selection function I : \T\ ^ [0, 1] 
associates a weight with each round, and defines a corresponding weighted notion of regret in the 
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natural way. Blum and Mansour's algorithm guarantees low weighted regret with respect to all 
M time selection functions simultaneously. This can be used to obtain low regret with respect to 
different (possibly overlapping) windows of time simultaneously, or to obtain low regret with respect 
to subsets of rounds that have particular features. By using their algorithm as a subroutine within 
TGbandit, we get similar guarantees, both in the full information and bandit feedback models. 

5 Evaluation 

We evaluate TGBANDIT experimentally on two applications: Leaming to rank blogs that are effective 
in detecting cascades of information, and allocating advertisements to maximize revenue. 

5.1 Online learning of diverse blog rankings 

We consider the problem of ranking a set of blogs and news sources on the web. Our approach is 
based on the following idea: A blogger writes a posting, and, after some time, other postings link to 
it, forming cascades of information propagating through the network of blogs. 

More formally, an information cascade is a directed acyclic graph of vertices (each vertex corresponds 
to a posting at some blog), where edges are annotated by the time difference between the postings. 
Based on this notion of an information cascade, we would like to select blogs that detect big 
cascades (containing many nodes) as early as possible (i.e., we want to learn about an important 
event before most other readers). In lfT3l it is shown how one can formalize this notion of utility 
using a monotone submodular function that measures the informativeness of a subset of blogs. 
Optimizing the submodular function yields a small set of blogs that "covers" most cascades. This 
utility function prefers diverse sets of blogs, minimizing the overlap of the detected cascades, and 
therefore minimizing redundancy. 

The work by ifTSll leaves two major shortcomings: Firstly, they select a set of blogs rather than a 
ranking, which is of practical importance for the presentation on a web service. Secondly, they do 
not address the problem of sequential prediction, where the set of blogs must be updated dynamically 
over time. In this paper, we address these shortcomings. 

Results on offline blog ranking. In order to model the blog ranking problem, we adopt the 
assumption that different users have different attention spans: Each user will only consider blogs 
appearing in a particular subset of positions. In our experiments, we assume that the probability that 
a user is willing to look at position k is proportional to 7*^, for some discount factor < 7 < 1. 
More formally, let g be the monotone submodular function measuring the informativeness of any set 
of blogs, defined as in ITJI . Let — B x {k}, where B is the set of blogs. Given an assignment 
S eV, let S'l'^l = S n {Pi U P2 U . . . U Ffc} be the assignment of blogs to positions 1 through k. 
We define the discounted value of the assignment S as f{S) = Y,k=i 1^ {gi.^'^^^) ~ ffl'S'''"^^')) • It 
can be seen that f -.2^ M>o is monotone submodular 

For our experiments, we use the data set of fVS\, consisting of 45,192 blogs, 16,551 cascades, and 
2 million postings collected during 12 months of 2006. We use the population affected objective 
of 1(13] , and use a discount factor of 7 = 0.8. Based on this data, we run our TabularGreedy 
algorithm with varying numbers of colors C on the blog data set. Fig.'l(a)!presents the results of this 
experiment. For each value of C, we generate 200 rankings, and report both the average performance 
and the maximum performance over the 200 trials. Increasing C leads to an improved performance 
over the locally greedy algorithm (C — 1). 

Results on online learning of blog rankings. We now consider the online problem where on 
each round t we want to output a ranking St- After we select the ranking, a new set of cascades 
occurs, modeled using a separate submodular function ft, and we obtain a reward of ft (St). In 
our experiments, we choose one assignment per day, and define ft as the utility associated with the 
cascades occurring on that day. Note that ft allows us to evaluate the performance of any possible 
ranking St, hence we can apply TGBANDIT in the full-information feedback model. 

We compare the performance of our online algorithm using C — 1 and C — A. Fig. 'l(b)'presents the 
average cumulative reward gained over time by both algorithms. We normalize the average reward by 
the utility achieved by the TabularGreedy algorithm (with C = 1) applied to the entire data set. 
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(a) Blogs: Offline results 
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(c) Ad display: Online results 



Figure 1: (a,b) Results for discounted blog ranking (7 = 0.8), in offline (a) and online (b) setting, (c) 
Performance of TGbandit with C — 1,2, and 4 colors for the sponsored search ad selection problem (each 
round is a query). Note that C = 1 corresponds to the online algorithm of [16 , 18 |. 



Fig. |l(b)| shows that the performance of both algorithms rapidly (within the first 47 rounds) converges 
to the performance of the offline algorithm. The TGBANDIT algorithm with C = 4 levels out at an 
approximately 4% higher reward than the algorithm with C = 1 . 



5.2 Online ad display 

We evaluate TGBANDIT for the sponsored search ad selection problem in a simple Markovian model 
incorporating the value of diverse results and complex position-dependence among clicks. In this 
model, each user u is defined by two sets of probabilities: Pciick(a) for each ad a G -4, and Pabandon(fc) 
for each position k S [K]. When presented an assignment of ads {oi, 02, . . . , Qk}, where Ofc 
occupies position k, the user scans the positions in increasing order. For each position k, the user 
clicks on with probability Pciick(afc), leaving the results page forever Otherwise, with probability 
(1 — Pciick(afc)) ■ Pabandon(fc), the user loscs interest and abandons the results without clicking on 
anything. Finally, with probability (1 — Pciick(a/c)) ■ (1 — Pabandon(^)), the user proceeds to look at 
position fc + 1. The reward function ft is the number of clicks, which is either zero or one. We only 
receive information about ft{St) (i.e., bandit feedback). 

In our evaluation, there are two types of users: those interested in all positions (pabandon = 0), and 
those that quickly lose interest (pabandon = 0.5). For both types of users we select Pciick(a) uniformly 
at random from [0, 1], independently for each ad a (once chosen, Pciick is fixed for all rounds). We 
use K = 6 positions, and 6 available ads. In Fig.'l(c)'we compare the performance of TGBANDIT 
with C = 4 to the online algorithm of 1 16 18], based on the average of 100 experiments. The latter 
algorithm is equivalent to running TGBANDIT with C — 1. The former achieves parity with the latter 
after roughly 10^ rounds, and dominates thereafter. 

It can be shown that with several different types of users with distinct Pciick( ) functions the offline 
problem of finding an assignment within 1 — ^ + e of optimal is NP-hard. This is in contrast to the 
case in which PcUck and Pabandon are the same for all users; in this case the offline problem simply 
requires finding an optimal policy for a Markov decision process, which can be done efficiently using 
well-known algorithms. A slightly different Markov model of user behavior which is efficiently 
solvable was considered in 1 1 1 . In that model, Pciick and Pabandon are the same for all users, and Pabandon 
is a function of the ad in the slot currently being scanned rather than its index. 

6 Related Work 

For a general introduction to the literature on submodular function maximization, see |fT9l . For 
applications of submodularity to machine learning and AI see ifTOl fTTI . 

Our offline problem is known as maximizing a monotone submodular function subject to a (simple) 
partition matroid constraint in the operations research and theoretical computer science communities. 
The study of this problem culminated in the elegant (1— 1/e) approximation algorithm of Vondrak |20 | 
and a matching unconditional lower bound of Mirrokni et al. fTTl. Vondrak's algorithm, called 
the continuous- greedy algorithm, has also been extended to handle arbitrary matroid constraints |4l. 
The continuous-greedy algorithm, however, cannot be applied to our problem directly, because it 
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requires the ability to sample /(•) on infeasible sets S ^V.ln our context, this means it must have 
the ability to ask (for example) what the revenue will be if ads ai and 02 are placed in position #1 
simultaneously. We do not know how to answer such questions in a way that leads to meaningful 
performance guarantees. 

In the online setting, the most closely related work is that of Streeter and Golovin ifTSl . Like us, they 
consider sequences of monotone submodular reward functions that arrive online, and develop an 
online algorithm that uses multi-armed bandit algorithms as subroutines. The key difference from our 
work is that, as in lfT6l . they are concerned with selecting a set of K items rather than the more general 
problem of selecting an assignment of items to positions addressed in this paper. Kakade et al. |9J 
considered the general problem of using a-approximation algorithms to construct no a-regret online 
algorithms, and essentially proved it could be done for the class of linear optimization problems in 
which the cost function has the form c(5, w) for a solution S and weight vector w, and c{S, w) is 
linear in w. However, their result is orthogonal to ours, because our objective function is submodular 
and not linear|3 

7 Conclusions 

In this paper, we showed that important problems, such as ad display in sponsored search and 
computing diverse rankings of information sources on the web, require optimizing assignments 
under submodular utility functions. We developed an efficient algorithm, TabularGreedy, which 
obtains the optimal approximation ratio of (1 — 1/e) for this NP-hard optimization problem. We 
also developed an online algorithm, TGbandit, that asymptotically achieves no (1 — l/e)-regret 
for the problem of repeatedly selecting informative assignments, under the full-information and 
bandit-feedback settings. Finally, we demonstrated that our algorithm outperforms previous work on 
two real world problems, namely online ranking of informative blogs and ad allocation. 
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Appendix A: Proofs 

Lemmal^lis a corollary of the following more general lemma. The difference between the two lemmas 
is that, unlike Lemma 2^, Lemma^ allows for the possibility that each argmax in the locally greedy 
algorithm is evaluated with additive error We will need this result in analyzing TGbandit later on. 

Lemma S. Let f : V M>o be a function of the form f{S) = /o(5) + Yl,k=i fki^ n Pk), 
where /q : 2^ — > M>o is monotone submodular, and fk ■ Pk ^>o '■s arbitrary for k > 1. Let 
L = {£i ,£2, . . ■ ,£K}y where £k G P^ (for I < k < K). Suppose that for any k, 

f{{£i,£2, . . . , 4}) > max {fi{£i,£2, 4-i} + x)} - Sk . (7.1) 

xePk 

Then 

K 

f{L) + fo{L)>max{f{S)}-Y,^k. 

fe=i 

Proof. Let OPT — argmaxg^p {f{S)}, and let OPT = {01,02, . . . ,ok}, where Ok G Pk (for 
I <k < K). Define 

Afe(L) := fk{{ok}) + /o(iU {01,02, . . .,0k}) - fo{L U {01,02, . . • ,Ofc_i}) . 
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Let Lk — {£i,£2, • ■ • , ^fc-i}- Using submodularity of /o, we have 

Afe(L) < fk{{ok}) + fo{Lk + Ok) - h{Lk) 
= f{Lk + Ok)~ f{Lk) 
< f{Lk+h)- f{Lk) + Sk. 

Then, using monotonicity of /q, 

K 

/(OPT) < fo{L U OPT) + J2 /fc(OPT n Pfe) 

fc=i 

K 



= /o(i)+^Afe(L) 

fe=i 

A' 

< fo{L) + J2 f{Lk + 4) - f{Lk) + Efc 



fe=i 



/o(L) + /(L)-/(0) + ^£fe. 



fc=i 



Rearranging this inequality and using /(0) > completes the proof. □ 

To analyze TGbandit, we will also need Theorem|9] which is a generalization of Theorem|3] 

Theorem 9. Suppose J is monotone submodular. Let G = {gk,c '■ k G [K] , c e [C]}, w/jere gfc^c € 
Pfe /or flZZ fc and c. Suppose that for all k £ [K] and c € [C], 

F{Gkc + 9k.c)> max \g^^ + x] -ek,c (7.2) 

where G'^ ^ ~ {gk',c' '■ k' G [K] , c' < c} U {.gfe',c '■ k' < k} (i.e., G^ ^ equals G just before g^.c is 
lien f{G) 2 



added). Then f{G) > P{G, K) ■ maxsgp {/(S")} — X^fcLi ^k,c, where P{C, K) is defined as 



Proof The proof is identical to the proof of Theorem[3]in the main text, using Lemma[8]in place of 
Lemma 121 □ 

Finally, we prove Theorem[6j which we restate here for convenience. 

Theorem 6 1. Let rfc^^ be the regret of £k,c, and let (3{K, C) = 1 - (l - - {'^)G-\ Then 



E 



.t=i 



> p{K, C) ■ max <| ^ ft{S) \ - E 



K C 
.fc=l c=l 



Proof. The idea of the proof is to view TGBANDIT as a version of TabularGreedy that, instead 
of greedily selecting single (element,color) pairs g^.c € Pfe x {c}, greedily selects (element vector, 
color) pairs gk^c G Pj x {c}, where T is the number of rounds. 

First note that for any k e [K], c £ [C], and x £ Pk x {c}, by definition of rk.c we have 

T / T \ 

E Ft {Gl. + gU > E ^* i^ic + ^) - rk,c ■ 

Taking the expectation of both sides over c, and choosing x to maximize the right hand side, we get 
T { ^ \ 

E^*(Glc + ffU> max EPt(G*,, + x) (7.3) 
where we define Ft{S) = E? [/t(sampleg.(S'))] and efe,c = E [rfc^J- 
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We now define some additional notation. For any set S of vectors in V'^, define 

T 

/(5') = ^/t({at:5e5}) . 



Next, for any set S of (element vector, color) pairs in U^LiI^aT ^ {^D^ define sample 



Uf=i G ■■ i^, Cfc) e 5}. Define F{S) = E,- [/(sample,-(S')) 



By linearity of expectation. 



F(S^) = ^ ({(ft, c) : (f, c) e 5}) . (7.4) 

t=i 

Let c = (f^, c), where x is such that {xt, c) — g\ ^ for all t e [T]. Analogously to G^" , define 
•^fc.c = : G M , c' < c} U {gk',c ■ k' < k}. By (j7l4|, for any (f , c) e V'^ x [C] we have 

^(^fe,c + c)) = ^f^i FtiGl'^ + (ft, c)). Combining this with we get 

F {g^c + 9k.c) > max {f (g^^^ + (a^, c)) } - e^,, (7.5) 

where is the unique element of {a}^ . Having proved ( |7.5| l, we can now use Theorem 9 to 
complete the proof. Let {a^ : a £ P^.} for each k G [K], and define a new partition matroid 

over ground set {a^ : a e V} with feasible solutions V -.^ |5 : Vfc G [X] , |S^n Pfc| < l|. Let 

G = {gk.c '■ k e [/v] , c € [G]}. As argued in the proof of Lemma 4| Ft is monotone submodular. 



Using this fact together with ( |7.4[ ), it is straightforward to show that Fitself is monotone submodular. 
Thus by Theorem |9j 

K C 

F(G) > /3(G, K) ■ max f /(S')l - V V £fc,c - 



and that 



To complete the proof, it suffices to show that F{G) — E ft{Gt) 
max^g^ |/('^)| > maxgg-p |X]tLi ft{S)^. Both facts follow easily from the definitions. □ 



12 



